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Shape  Matching  and  Object 
Recognition  Using  Shape  Contexts 
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Abstract — We  present  a  novel  approach  to  measuring  similarity  between  shapes  and  exploit  it  for  object  recognition.  In  our 
framework,  the  measurement  of  similarity  is  preceded  by  1)  solving  for  correspondences  between  points  on  the  two  shapes,  2)  using 
the  correspondences  to  estimate  an  aligning  transform.  In  order  to  solve  the  correspondence  problem,  we  attach  a  descriptor,  the 
shape  context,  to  each  point.  The  shape  context  at  a  reference  point  captures  the  distribution  of  the  remaining  points  relative  to  it,  thus 
offering  a  globally  discriminative  characterization.  Corresponding  points  on  two  similar  shapes  will  have  similar  shape  contexts, 
enabling  us  to  solve  for  correspondences  as  an  optimal  assignment  problem.  Given  the  point  correspondences,  we  estimate  the 
transformation  that  best  aligns  the  two  shapes;  regularized  thin-plate  splines  provide  a  flexible  class  of  transformation  maps  for  this 
purpose.  The  dissimilarity  between  the  two  shapes  is  computed  as  a  sum  of  matching  errors  between  corresponding  points,  together 
with  a  term  measuring  the  magnitude  of  the  aligning  transform.  We  treat  recognition  in  a  nearest-neighbor  classification  framework  as 
the  problem  of  finding  the  stored  prototype  shape  that  is  maximally  similar  to  that  in  the  image.  Results  are  presented  for  silhouettes, 
trademarks,  handwritten  digits,  and  the  COIL  data  set. 

Index  Terms — Shape,  object  recognition,  digit  recognition,  correspondence  problem,  MPEG7,  image  registration,  deformable 
templates. 
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1  Introduction 

ONSIDER  the  two  handwritten  digits  in  Fig.  1.  Regarded 
as  vectors  of  pixel  brightness  values  and  compared 
using  L-2  norms,  they  are  very  different.  However,  regarded 
as  shapes  they  appear  rather  similar  to  a  human  observer. 
Our  objective  in  this  paper  is  to  operationalize  this  notion  of 
shape  similarity,  with  the  ultimate  goal  of  using  it  as  a  basis 
for  category-level  recognition.  We  approach  this  as  a  three- 
stage  process: 

1 .  solve  the  correspondence  problem  between  the  two 
shapes, 

2.  use  the  correspondences  to  estimate  an  aligning 
transform,  and 

3.  compute  the  distance  between  the  two  shapes  as  a 
sum  of  matching  errors  between  corresponding 
points,  together  with  a  term  measuring  the  magni¬ 
tude  of  the  aligning  transformation. 

At  the  heart  of  our  approach  is  a  tradition  of  matching 
shapes  by  deformation  that  can  be  traced  at  least  as  far  back 
as  D'Arcy  Thompson.  In  his  classic  work.  On  Growth  and 
Form  [55],  Thompson  observed  that  related  but  not  identical 
shapes  can  often  be  deformed  into  alignment  using  simple 
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coordinate  transformations,  as  illustrated  in  Fig.  2.  In  the 
computer  vision  literature,  Fischler  and  Elschlager  [15] 
operationalized  such  an  idea  by  means  of  energy  mini¬ 
mization  in  a  mass-spring  model.  Grenander  et  al.  [21] 
developed  these  ideas  in  a  probabilistic  setting.  Yuille  [61] 
developed  another  variant  of  the  deformable  template 
concept  by  means  of  fitting  hand-crafted  parametrized 
models,  e.g.,  for  eyes,  in  the  image  domain  using  gradient 
descent.  Another  well-known  computational  approach  in 
this  vein  was  developed  by  Lades  et  al.  [31]  using  elastic 
graph  matching. 

Our  primary  contribution  in  this  work  is  a  robust  and 
simple  algorithm  for  finding  correspondences  between 
shapes.  Shapes  are  represented  by  a  set  of  points  sampled 
from  the  shape  contours  (typically  100  or  so  pixel  locations 
sampled  from  the  output  of  an  edge  detector  are  used). 
There  is  nothing  special  about  the  points.  They  are  not 
required  to  be  landmarks  or  curvature  extrema,  etc.;  as  we 
use  more  samples,  we  obtain  better  approximations  to  the 
underlying  shape.  We  introduce  a  shape  descriptor,  the 
shape  context,  to  describe  the  coarse  distribution  of  the  rest  of 
the  shape  with  respect  to  a  given  point  on  the  shape. 
Finding  correspondences  between  two  shapes  is  then 
equivalent  to  finding  for  each  sample  point  on  one  shape 
the  sample  point  on  the  other  shape  that  has  the  most 
similar  shape  context.  Maximizing  similarities  and  enfor¬ 
cing  uniqueness  naturally  leads  to  a  setup  as  a  bipartite 
graph  matching  (equivalently,  optimal  assignment)  pro¬ 
blem.  As  desired,  we  can  incorporate  other  sources  of 
matching  information  readily,  e.g.,  similarity  of  local 
appearance  at  corresponding  points. 

Given  the  correspondences  at  sample  points,  we  extend 
the  correspondence  to  the  complete  shape  by  estimating  an 
aligning  transformation  that  maps  one  shape  onto  the  other. 
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Fig.  1.  Examples  of  two  handwritten  digits.  In  terms  of  pixel-to-pixel 
comparisons,  these  two  images  are  quite  different,  but  to  the  human 
observer,  the  shapes  appear  to  be  similar. 

A  classic  illustration  of  this  idea  is  provided  in  Fig.  2.  The 
transformations  can  be  picked  from  any  of  a  number  of 
families — we  have  used  Euclidean,  affine,  and  regularized 
thin  plate  splines  in  various  applications.  Aligning  shapes 
enables  us  to  define  a  simple,  yet  general,  measure  of  shape 
similarity.  The  dissimilarity  between  the  two  shapes  can 
now  be  computed  as  a  sum  of  matching  errors  between 
corresponding  points,  together  with  a  term  measuring  the 
magnitude  of  the  aligning  transform. 

Given  such  a  dissimilarity  measure,  we  can  use 
nearest-neighbor  techniques  for  object  recognition.  Philo¬ 
sophically,  nearest-neighbor  techniques  can  be  related  to 
prototype-based  recognition  as  developed  by  Rosch  [47] 
and  Rosch  et  al.  [48].  They  have  the  advantage  that  a 
vector  space  structure  is  not  required — only  a  pairwise 
dissimilarity  measure. 

We  demonstrate  object  recognition  in  a  wide  variety  of 
settings.  We  deal  with  2D  objects,  e.g.,  the  MNIST  data  set 
of  handwritten  digits  (Fig.  8),  silhouettes  (Figs.  11  and  13) 
and  trademarks  (Fig.  12),  as  well  as  3D  objects  from  the 
Columbia  COIL  data  set,  modeled  using  multiple  views 
(Fig.  10).  These  are  widely  used  benchmarks  and  our 
approach  turns  out  to  be  the  leading  performer  on  all  the 
problems  for  which  there  is  comparative  data. 

We  have  also  developed  a  technique  for  selecting  the 
number  of  stored  views  for  each  object  category  based  on  its 
visual  complexity.  As  an  illustration,  we  show  that  for  the 
3D  objects  in  the  COIL-20  data  set,  one  can  obtain  as  low  as 
2.5  percent  misclassification  error  using  only  an  average  of 
four  views  per  object  (see  Figs.  9  and  10). 

The  structure  of  this  paper  is  as  follows:  We  discuss 
related  work  in  Section  2.  In  Section  3,  we  describe  our 
shape-matching  method  in  detail.  Our  transformation 
model  is  presented  in  Section  4.  We  then  discuss  the 
problem  of  measuring  shape  similarity  in  Section  5  and 
demonstrate  our  proposed  measure  on  a  variety  of 


databases  including  handwritten  digits  and  pictures  of 
3D  objects  in  Section  6.  We  conclude  in  Section  7. 

2  Prior  Work  on  Shape  Matching 

Mathematicians  typically  define  shape  as  an  equivalence 
class  under  a  group  of  transformations.  This  definition  is 
incomplete  in  the  context  of  visual  analysis.  This  only  tells 
us  when  two  shapes  are  exactly  the  same.  We  need  more 
than  that  for  a  theory  of  shape  similarity  or  shape  distance. 
The  statistician's  definition  of  shape,  e.g.,  Bookstein  [6]  or 
Kendall  [29],  addresses  the  problem  of  shape  distance,  but 
assumes  that  correspondences  are  known.  Other  statistical 
approaches  to  shape  comparison  do  not  require  correspon¬ 
dences — e.g.,  one  could  compare  feature  vectors  containing 
descriptors  such  as  area  or  moments — but  such  techniques 
often  discard  detailed  shape  information  in  the  process. 
Shape  similarity  has  also  been  studied  in  the  psychology 
literature,  an  early  example  being  Goldmeier  [20]. 

An  extensive  survey  of  shape  matching  in  computer 
vision  can  be  found  in  [58],  [22].  Broadly  speaking,  there  are 
two  approaches:  1)  feature-based,  which  involve  the  use  of 
spatial  arrangements  of  extracted  features  such  as  edge 
elements  or  junctions,  and  2)  brightness-based,  which  make 
more  direct  use  of  pixel  brightnesses. 

2.1  Feature-Based  Methods 

A  great  deal  of  research  on  shape  similarity  has  been  done 
using  the  boundaries  of  silhouette  images.  Since  silhouettes 
do  not  have  holes  or  internal  markings,  the  associated 
boundaries  are  conveniently  represented  by  a  single-closed 
curve  which  can  be  parametrized  by  arclength.  Early  work 
used  Fourier  descriptors,  e.g.,  [62],  [43].  Blum's  medial  axis 
transform  has  led  to  attempts  to  capture  the  part  structure  of 
the  shape  in  the  graph  structure  of  the  skeleton  by  Kimia, 
Zucker  and  collaborators,  e.g.,  Sharvit  et  al.  [53].  The 
ID  nature  of  silhouette  curves  leads  naturally  to  dynamic 
programming  approaches  for  matching,  e.g.,  [17],  which  uses 
the  edit  distance  between  curves.  This  algorithm  is  fast  and 
invariant  to  several  kinds  of  transformation  including  some 
articulation  and  occlusion.  A  comprehensive  comparison  of 
different  shape  descriptors  for  comparing  silhouettes  was 
done  as  part  of  the  MPEG-7  standard  activity  [33],  with  the 
leading  approaches  being  those  due  to  Latecki  et  al.  [33]  and 
Mokhtarian  et  al.  [39]. 


Fig.  2.  Example  of  coordinate  transformations  relating  two  fish,  from  D’Arcy  Thompson's  On  Growth  and  Form  [55],  Thompson  observed  that  similar 
biological  forms  could  be  related  by  means  of  simple  mathematical  transformations  between  homologous  (i.e.,  corresponding)  features.  Examples  of 
homologous  features  include  center  of  eye,  tip  of  dorsal  fin,  etc. 
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Silhouettes  are  fundamentally  limited  as  shape  descrip¬ 
tors  for  general  objects;  they  ignore  internal  contours  and 
are  difficult  to  extract  from  real  images.  More  promising  are 
approaches  that  treat  the  shape  as  a  set  of  points  in  the 
2D  image.  Extracting  these  from  an  image  is  less  of  a 
problem — e.g.,  one  can  just  use  an  edge  detector.  Hutten- 
locher  et  al.  developed  methods  in  this  category  based  on 
the  Hausdorff  distance  [23];  this  can  be  extended  to  deal 
with  partial  matching  and  clutter.  A  drawback  for  our 
purposes  is  that  the  method  does  not  return  correspon¬ 
dences.  Methods  based  on  Distance  Transforms,  such  as 
[16],  are  similar  in  spirit  and  behavior  in  practice. 

The  work  of  Sclaroff  and  Pentland  [50]  is  representative 
of  the  eigenvector-  or  modal-matching  based  approaches; 
see  also  [52],  [51],  [57],  In  this  approach,  sample  points  in 
the  image  are  cast  into  a  finite  element  spring-mass  model 
and  correspondences  are  found  by  comparing  modes  of 
vibration.  Most  closely  related  to  our  approach  is  the  work 
of  Gold  et  al.  [19]  and  Chui  and  Rangarajan  [9],  which  is 
discussed  in  Section  3.4. 

There  have  been  several  approaches  to  shape  recognition 
based  on  spatial  configurations  of  a  small  number  of 
keypoints  or  landmarks.  In  geometric  hashing  [32],  these 
configurations  are  used  to  vote  for  a  model  without 
explicitly  solving  for  correspondences.  Amit  et  al.  [1]  train 
decision  trees  for  recognition  by  learning  discriminative 
spatial  configurations  of  keypoints.  Leung  et  al.  [35], 
Schmid  and  Mohr  [49],  and  Lowe  [36]  additionally  use 
gray-level  information  at  the  keypoints  to  provide  greater 
discriminative  power.  It  should  be  noted  that  not  all  objects 
have  distinguished  key  points  (think  of  a  circle  for 
instance),  and  using  key  points  alone  sacrifices  the  shape 
information  available  in  smooth  portions  of  object  contours. 

2.2  Brightness-Based  Methods 

Brightness-based  (or  appearance-based)  methods  offer  a 
complementary  view  to  feature-based  methods.  Instead  of 
focusing  on  the  shape  of  the  occluding  contour  or  other 
extracted  features,  these  approaches  make  direct  use  of  the 
gray  values  within  the  visible  portion  of  the  object.  One  can 
use  brightness  information  in  one  of  two  frameworks. 

In  the  first  category,  we  have  the  methods  that  explicitly 
find  correspondences /alignment  using  grayscale  values. 
Yuille  [61]  presents  a  very  flexible  approach  in  that 
invariance  to  certain  kinds  of  transformations  can  be  built 
into  the  measure  of  model  similarity,  but  it  suffers  from  the 
need  for  human-designed  templates  and  the  sensitivity  to 
initialization  when  searching  via  gradient  descent.  Lades  et 
al.  [31]  use  elastic  graph  matching,  an  approach  that 
involves  both  geometry  and  photometric  features  in  the 
form  of  local  descriptors  based  on  Gaussian  derivative  jets. 
Vetter  et  al.  [59]  and  Cootes  et  al.  [10]  compare  brightness 
values  but  first  attempt  to  warp  the  images  onto  one 
another  using  a  dense  correspondence  field. 

The  second  category  includes  those  methods  that  build 
classifiers  without  explicitly  finding  correspondences.  In 
such  approaches,  one  relies  on  a  learning  algorithm  having 
enough  examples  to  acquire  the  appropriate  invariances.  In 
the  area  of  face  recognition,  good  results  were  obtained  using 
principal  components  analysis  (PCA)  [54],  [56]  particularly 


when  used  in  a  probabilistic  framework  [38].  Murase  and 
Nayar  applied  these  ideas  to  3D  object  recognition  [40]. 
Several  authors  have  applied  discriminative  classification 
methods  in  the  appearance-based  shape  matching  frame¬ 
work.  Some  examples  are  the  LeNet  classifier  [34],  a 
convolutional  neural  network  for  handwritten  digit  recogni¬ 
tion,  and  the  Support  Vector  Machine  (SVM)-based  methods 
of  [41]  (for  discriminating  between  templates  of  pedestrians 
based  on  2D  wavelet  coefficients)  and  [11],  [7]  (for  hand¬ 
written  digit  recognition).  The  MNIST  database  of  hand¬ 
written  digits  is  a  particularly  important  data  set  as  many 
different  pattern  recognition  algorithms  have  been  tested  on 
it.  We  will  show  our  results  on  MNIST  in  Section  6.1. 

3  Matching  with  Shape  Contexts 

In  our  approach,  we  treat  an  object  as  a  (possibly  infinite) 
point  set  and  we  assume  that  the  shape  of  an  object  is 
essentially  captured  by  a  finite  subset  of  its  points.  More 
practically,  a  shape  is  represented  by  a  discrete  set  of  points 
sampled  from  the  internal  or  external  contours  on  the 
object.  These  can  be  obtained  as  locations  of  edge  pixels  as 
found  by  an  edge  detector,  giving  us  a  set  V  =  {p\  . . . .  ,pn}, 
Pi  E  1RJ,  of  n  points.  They  need  not,  and  typically  will  not, 
correspond  to  key-points  such  as  maxima  of  curvature  or 
inflection  points.  We  prefer  to  sample  the  shape  with 
roughly  uniform  spacing,  though  this  is  also  not  critical.1 
Ligs.  3a  and  3b  show  sample  points  for  two  shapes. 
Assuming  contours  are  piecewise  smooth,  we  can  obtain 
as  good  an  approximation  to  the  underlying  continuous 
shapes  as  desired  by  picking  n  to  be  sufficiently  large. 

3.1  Shape  Context 

Lor  each  point  p,  on  the  first  shape,  we  want  to  find  the 
"best"  matching  point  qj  on  the  second  shape.  This  is  a 
correspondence  problem  similar  to  that  in  stereopsis. 
Experience  there  suggests  that  matching  is  easier  if  one 
uses  a  rich  local  descriptor,  e.g.,  a  gray-scale  window  or  a 
vector  of  filter  outputs  [27],  instead  of  just  the  brightness  at 
a  single  pixel  or  edge  location.  Rich  descriptors  reduce  the 
ambiguity  in  matching. 

As  a  key  contribution,  we  propose  a  novel  descriptor,  the 
shape  context,  that  could  play  such  a  role  in  shape  matching. 
Consider  the  set  of  vectors  originating  from  a  point  to  all 
other  sample  points  on  a  shape.  These  vectors  express  the 
configuration  of  the  entire  shape  relative  to  the  reference 
point.  Obviously,  this  set  of  n—  1  vectors  is  a  rich 
description,  since  as  n  gets  large,  the  representation  of  the 
shape  becomes  exact. 

The  full  set  of  vectors  as  a  shape  descriptor  is  much  too 
detailed  since  shapes  and  their  sampled  representation  may 
vary  from  one  instance  to  another  in  a  category.  We  identify 
the  distribution  over  relative  positions  as  a  more  robust  and 
compact,  yet  highly  discriminative  descriptor.  Lor  a  point  pt 
on  the  shape,  we  compute  a  coarse  histogram  hi  of  the 
relative  coordinates  of  the  remaining  n  —  1  points, 

hi(k)  =  #{g  ^  pz  :  (q  -  Pi)  G  bin(fc)}.  (1) 

1.  Sampling  considerations  are  discussed  in  Appendix  B. 
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Fig.  3.  Shape  context  computation  and  matching,  (a)  and  (b)  Sampled  edge  points  of  two  shapes,  (c)  Diagram  of  log-polar  histogram  bins  used  in 
computing  the  shape  contexts.  We  use  five  bins  for  logr  and  12  bins  for  9.  (d),  (e),  and  (f)  Example  shape  contexts  for  reference  samples  marked  by 
o,«,<i  in  (a)  and  (b).  Each  shape  context  is  a  log-polar  histogram  of  the  coordinates  of  the  rest  of  the  point  set  measured  using  the  reference  point  as 
the  origin.  (Dark=large  value.)  Note  the  visual  similarity  of  the  shape  contexts  for  o  and  o,  which  were  computed  for  relatively  similar  points  on  the  two 
shapes.  By  contrast,  the  shape  context  for  o  is  quite  different,  (g)  Correspondences  found  using  bipartite  matching,  with  costs  defined  by  the  x2 
distance  between  histograms. 


This  histogram  is  defined  to  be  the  shape  context  of  p,.  We 
use  bins  that  are  uniform  in  log-polar2  space,  making  the 
descriptor  more  sensitive  to  positions  of  nearby  sample 
points  than  to  those  of  points  farther  away.  An  example  is 
shown  in  Fig.  3c. 

Consider  a  point  pt  on  the  first  shape  and  a  point  qj  on 
the  second  shape.  Let  Cjj  =  C(pi,qj)  denote  the  cost  of 
matching  these  two  points.  As  shape  contexts  are 
distributions  represented  as  histograms,  it  is  natural  to 
use  the  x2  test  statistic: 


Cij  —  C(pi,qj) 


1  y''  [hi(k)  —  hj(k)Y 
2£-f  hi{k)  +  hj{k)  ’ 


where  hi(k)  and  hjlk)  denote  the  K- bin  normalized 
histogram  at  p,  and  q:/,  respectively.3 

The  cost  Cij  for  matching  points  can  include  an 
additional  term  based  on  the  local  appearance  similarity  at 
points  Pi  and  qj.  This  is  particularly  useful  when  we  are 
comparing  shapes  derived  from  gray-level  images  instead 
of  line  drawings.  For  example,  one  can  add  a  cost  based  on 
normalized  correlation  scores  between  small  gray-scale 
patches  centered  at  p,  and  qJr  distances  between  vectors  of 
filter  outputs  at  Pi  and  q3,  tangent  orientation  difference 
between  p,  and  q3,  and  so  on.  The  choice  of  this  appearance 
similarity  term  is  application  dependent,  and  is  driven  by 
the  necessary  invariance  and  robustness  requirements,  e.g., 
varying  lighting  conditions  make  reliance  on  gray-scale 
brightness  values  risky. 


2.  This  choice  corresponds  to  a  linearly  increasing  positional  uncertainty 
with  distance  from  pir  a  reasonable  result  if  the  transformation  between  the 
shapes  around  pi  can  be  locally  approximated  as  affine. 

3.  Alternatives  include  BickeTs  generalization  of  the  Kolmogorov- 
Smirnov  test  for  2D  distributions  [4],  which  does  not  require  binning. 


3.2  Bipartite  Graph  Matching 

Given  the  set  of  costs  Ci3  between  all  pairs  of  points  p,  on 
the  first  shape  and  q3  on  the  second  shape,  we  want  to 
minimize  the  total  cost  of  matching, 

=  YlC(Pi’Mi))  (2) 

i 

subject  to  the  constraint  that  the  matching  be  one-to-one,  i.e.,  n 
is  a  permutation.  This  is  an  instance  of  the  square  assignment 
(or  weighted  bipartite  matching)  problem,  which  can  be 
solved  in  0(N3)  time  using  the  ITungarian  method  [42].  In  our 
experiments,  we  use  the  more  efficient  algorithm  of  [28] .  The 
input  to  the  assignment  problem  is  a  square  cost  matrix  with 
entries  Clr  The  result  is  a  permutation  7r(i)  such  that  (2)  is 
minimized. 

In  order  to  have  robust  handling  of  outliers,  one  can  add 
“dummy"  nodes  to  each  point  set  with  a  constant  matching 
cost  of  e<2.  In  this  case,  a  point  will  be  matched  to  a 
“dummy"  whenever  there  is  no  real  match  available  at 
smaller  cost  than  q.  Thus,  c,j  can  be  regarded  as  a  threshold 
parameter  for  outlier  detection.  Similarly,  when  the  number 
of  sample  points  on  two  shapes  is  not  equal,  the  cost  matrix 
can  be  made  square  by  adding  dummy  nodes  to  the  smaller 
point  set. 

3.3  Invariance  and  Robustness 

A  matching  approach  should  be  1)  invariant  under  scaling 
and  translation,  and  2)  robust  under  small  geometrical 
distortions,  occlusion  and  presence  of  outliers.  In  certain 
applications,  one  may  want  complete  invariance  under 
rotation,  or  perhaps  even  the  full  group  of  affine  transfor¬ 
mations.  We  now  evaluate  shape  context  matching  by  these 
criteria. 
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Invariance  to  translation  is  intrinsic  to  the  shape  context 
definition  since  all  measurements  are  taken  with  respect  to 
points  on  the  object.  To  achieve  scale  invariance  we 
normalize  all  radial  distances  by  the  mean  distance  a 
between  the  n2  point  pairs  in  the  shape. 

Since  shape  contexts  are  extremely  rich  descriptors,  they 
are  inherently  insensitive  to  small  perturbations  of  parts  of 
the  shape.  While  we  have  no  theoretical  guarantees  here, 
robustness  to  small  nonlinear  transformations,  occlusions 
and  presence  of  outliers  is  evaluated  experimentally  in 
Section  4.2. 

In  the  shape  context  framework,  we  can  provide  for 
complete  rotation  invariance,  if  this  is  desirable  for  an 
application.  Instead  of  using  the  absolute  frame  for 
computing  the  shape  context  at  each  point,  one  can  use  a 
relative  frame,  based  on  treating  the  tangent  vector  at  each 
point  as  the  positive  a:-axis.  In  this  way,  the  reference  frame 
turns  with  the  tangent  angle,  and  the  result  is  a  completely 
rotation  invariant  descriptor.  In  Appendix  A,  we  demon¬ 
strate  this  experimentally.  It  should  be  emphasized  though 
that,  in  many  applications,  complete  invariance  impedes 
recognition  performance,  e.g.,  when  distinguishing  6  from  9 
rotation  invariance  would  be  completely  inappropriate. 
Another  drawback  is  that  many  points  will  not  have  well- 
defined  or  reliable  tangents.  Moreover,  many  local  appear¬ 
ance  features  lose  their  discriminative  power  if  they  are  not 
measured  in  the  same  coordinate  system. 

Additional  robustness  to  outliers  can  be  obtained  by 
excluding  the  estimated  outliers  from  the  shape  context 
computation.  More  specifically,  consider  a  set  of  points  that 
have  been  labeled  as  outliers  on  a  given  iteration.  We 
render  these  points  "invisible"  by  not  allowing  them  to 
contribute  to  any  histogram.  However,  we  still  assign  them 
shape  contexts,  taking  into  account  only  the  surrounding 
inlier  points,  so  that  at  a  later  iteration  they  have  a  chance  of 
reemerging  as  an  inlier. 

3.4  Related  Work 

The  most  comprehensive  body  of  work  on  shape  corre¬ 
spondence  in  this  general  setting  is  the  work  of  Gold  et  al. 
[19]  and  Chui  and  Rangarajan  [9].  They  developed  an 
iterative  optimization  algorithm  to  determine  point  corre¬ 
spondences  and  underlying  image  transformations  jointly, 
where  typically  some  generic  transformation  class  is 
assumed,  e.g.,  affine  or  thin  plate  splines.  The  cost  function 
that  is  being  minimized  is  the  sum  of  Euclidean  distances 
between  a  point  on  the  first  shape  and  the  transformed 
second  shape.  This  sets  up  a  chicken-and-egg  problem:  The 
distances  make  sense  only  when  there  is  at  least  a  rough 
alignment  of  shape.  Joint  estimation  of  correspondences 
and  shape  transformation  leads  to  a  difficult,  highly  non- 
convex  optimization  problem,  which  is  solved  using 
deterministic  annealing  [19].  The  shape  context  is  a  very 
discriminative  point  descriptor,  facilitating  easy  and  robust 
correspondence  recovery  by  incorporating  global  shape 
information  into  a  local  descriptor. 

As  far  as  we  are  aware,  the  shape  context  descriptor  and 
its  use  for  matching  2D  shapes  is  novel.  The  most  closely 
related  idea  in  past  work  is  that  due  to  Johnson  and  Hebert 
[26]  in  their  work  on  range  images.  They  introduced  a 
representation  for  matching  dense  clouds  of  oriented 


3D  points  called  the  "spin  image."  A  spin  image  is  a 
2D  histogram  formed  by  spinning  a  plane  around  a  normal 
vector  on  the  surface  of  the  object  and  counting  the  points 
that  fall  inside  bins  in  the  plane.  As  the  size  of  this  plane  is 
relatively  small,  the  resulting  signature  is  not  as  informative 
as  a  shape  context  for  purposes  of  recovering  correspon¬ 
dences.  This  characteristic,  however,  might  have  the  trade¬ 
off  of  additional  robustness  to  occlusion.  In  another  related 
work,  Carlsson  [8]  has  exploited  the  concept  of  order 
structure  for  characterizing  local  shape  configurations.  In 
this  work,  the  relationships  between  points  and  tangent 
lines  in  a  shape  are  used  for  recovering  correspondences. 


4  Modeling  Transformations 

Given  a  finite  set  of  correspondences  between  points  on  two 
shapes,  one  can  proceed  to  estimate  a  plane  transformation 
T  :  1R2 — » Q  t 2  that  may  be  used  to  map  arbitrary  points  from 
one  shape  to  the  other.  This  idea  is  illustrated  by  the 
warped  gridlines  in  Fig.  2,  wherein  the  specified  corre¬ 
spondences  consisted  of  a  small  number  of  landmark  points 
such  as  the  centers  of  the  eyes,  the  tips  of  the  dorsal  fins, 
etc.,  and  T  extends  the  correspondences  to  arbitrary  points. 

We  need  to  choose  T  from  a  suitable  family  of 
transformations.  A  standard  choice  is  the  affine  model,  i.e., 


T(x)  =  Ax  +  o  (3) 

with  some  matrix  A  and  a  translational  offset  vector  o 
parameterizing  the  set  of  all  allowed  transformations.  Then, 
the  least  squares  solution  T  =  (A,  6)  is  obtained  by 


1  n 

°  =  nXXft  (4) 

i— 1 

i=(Q+P)\  (5) 

where  P  and  Q  contain  the  homogeneous  coordinates  of  V 
and  Q,  respectively,  i.e.. 


(1  Pu  P12 
1  Pn  1  Pn2 


(6) 


Here,  Q+  denotes  the  pseudoinverse  of  Q. 

In  this  work,  we  mostly  use  the  thin  plate  spline  (TPS) 
model  [14],  [37],  which  is  commonly  used  for  representing 
flexible  coordinate  transformations.  Bookstein  [6]  found  it 
to  be  highly  effective  for  modeling  changes  in  biological 
forms.  Powell  applied  the  TPS  model  to  recover  transfor¬ 
mations  between  curves  [44].  The  thin  plate  spline  is  the 
2D  generalization  of  the  cubic  spline.  In  its  regularized 
form,  which  is  discussed  below,  the  TPS  model  includes  the 
affine  model  as  a  special  case.  We  will  now  provide  some 
background  information  on  the  TPS  model. 

We  start  with  the  ID  interpolation  problem.  Let  vl  denote 
the  target  function  values  at  corresponding  locations  pt  = 
{xi,yf)  in  the  plane,  with  i=  1,2,  ...,n.  In  particular,  we 
will  set  V,  equal  to  x[  and  in  turn  to  obtain  one  continuous 
transformation  for  each  coordinate.  We  assume  that  the 
locations  ( )  are  all  different  and  are  not  collinear.  The 
TPS  interpolant  f(x,  y)  minimizes  the  bending  energy 
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and  has  the  form: 

n 

f(x,y)  =  ai  +  axx  +  ayy  +  ^2wiU(\\(xi,yi)  -  (x,  j/)||), 

i—1 

where  the  kernel  function  U(r)  is  defined  by  U(r)  =  r2  log  r2 
and  U( 0)  =  0  as  usual.  In  order  for  f(x,y)  to  have  square 
integrable  second  derivatives,  we  require  that 

n  n  n 

'^Wi  =  0  and  ^  wtXi  =  ^2,wiyl  =  0.  (7) 

i— 1  i—1  z=l 

Together  with  the  interpolation  conditions,  f(xi,yi)=Vi, 
this  yields  a  linear  system  for  the  TPS  coefficients: 


Ik 

A 

”  1 

„  \ 

AT 

0  J 

V  “ J 

l 0  J 

where  Ky  =  U(\\(xi,yi)  —  {xj,yj)\\),  the  itli  row  of  P  is 
(1  ,Xi,yi),w  and  v  are  column  vectors  formed  from  wt  and  vt, 
respectively,  and  a  is  the  column  vector  with  elements 
ai,ax,  ay.  We  will  denote  the  (n  +  3)  x  (n  +  3)  matrix  of  this 
system  by  L.  As  discussed,  e.g.,  in  [44],  L  is  nonsingular  and 
we  can  find  the  solution  by  inverting  L.  If  we  denote  the 
upper  left  n  x  n  block  of  L"1  by  A,  then  it  can  be  shown  that 

If(xvTAv  =  wtKw.  (9) 

4.1  Regularization  and  Scaling  Behavior 

When  there  is  noise  in  the  specified  values  vt,  one  may  wish 
to  relax  the  exact  interpolation  requirement  by  means  of 
regularization.  This  is  accomplished  by  minimizing 

n 

H[  /]  =  ^(Vi  ~  f(x"  +  XIf  ■  (10) 

i=l 

The  regularization  parameter  A,  a  positive  scalar,  controls 
the  amount  of  smoothing;  the  limiting  case  of  A  =  0 
reduces  to  exact  interpolation.  As  demonstrated  in  [60], 
[18],  we  can  solve  for  the  TPS  coefficients  in  the 
regularized  case  by  replacing  the  matrix  K  by  K  +  XI, 
where  I  is  the  n  x  n  identity  matrix.  It  is  interesting  to 
note  that  the  highly  regularized  TPS  model  degenerates  to 
the  least-squares  affine  model. 

To  address  the  dependence  of  A  on  the  data  scale, 
suppose  ( Xi,yi )  and  (x(,y()  are  replaced  by  (cur,, cry,)  and 
(ax';  ny'),  respectively,  for  some  positive  constant  a.  Then, 
it  can  be  shown  that  the  parameters  w,a,If  of  the  optimal 
thin  plate  spline  are  unaffected  if  A  is  replaced  by  a2  A.  This 
simple  scaling  behavior  suggests  a  normalized  definition  of 
the  regularization  parameter.  Let  a  again  represent  the  scale 
of  the  point  set  as  estimated  by  the  mean  edge  length 
between  two  points  in  the  set.  Then,  we  can  define  A  in 
terms  of  a  and  A0,  a  scale-independent  regularization 
parameter,  via  the  simple  relation  A  =  a2Ac. 


We  use  two  separate  TPS  functions  to  model  a  coordinate 
transformation, 

T(x,y)  =  {fx(x,y),fy(x,y))  (11) 

which  yields  a  displacement  field  that  maps  any  position  in 
the  first  image  to  its  interpolated  location  in  the  second 
image. 

In  many  cases,  the  initial  estimate  of  the  correspondences 
contains  some  errors  which  could  degrade  the  quality  of  the 
transformation  estimate.  The  steps  of  recovering  correspon¬ 
dences  and  estimating  transformations  can  be  iterated  to 
overcome  this  problem.  We  usually  use  a  fixed  number  of 
iterations,  typically  three  in  large-scale  experiments,  but 
more  refined  schemes  are  possible.  However,  experimental 
experiences  show  that  the  algorithmic  performance  is 
independent  of  the  details.  An  example  of  the  iterative 
algorithm  is  illustrated  in  Fig.  4. 

4.2  Empirical  Robustness  Evaluation 

In  order  to  study  the  robustness  of  our  proposed  method, 
we  performed  the  synthetic  point  set  matching  experiments 
described  in  [9].  The  experiments  are  broken  into  three 
parts  designed  to  measure  robustness  to  deformation,  noise, 
and  outliers.  (The  latter  tests  each  include  a  "moderate" 
amount  of  deformation.)  In  each  test,  we  subjected  the 
model  point  set  to  one  of  the  above  distortions  to  create  a 
"target"  point  set;  see  Fig.  5.  We  then  ran  our  algorithm  to 
find  the  best  warping  between  the  model  and  the  target. 
Finally,  the  performance  is  quantified  by  computing  the 
average  distance  between  the  coordinates  of  the  warped 
model  and  those  of  the  target.  The  results  are  shown  in 
Fig.  6.  In  the  most  challenging  part  of  the  test — the  outlier 
experiment — our  approach  shows  robustness  even  up  to  a 
level  of  100  percent  outlier-to-data  ratio. 

In  practice,  we  will  need  robustness  to  occlusion  and 
segmentation  errors  which  can  be  explored  only  in  the 
context  of  a  complete  recognition  system,  though  these 
experiments  provide  at  least  some  guidelines. 

4.3  Computational  Demands 

In  our  implementation  on  a  regular  Pentium  III  /500  MHz 
workstation,  a  single  comparison  including  computation  of 
shape  context  for  100  sample  points,  set-up  of  the  full 
matching  matrix,  bipartite  graph  matching,  computation  of 
the  TPS  coefficients,  and  image  warping  for  three  cycles 
takes  roughly  200ms.  The  runtime  is  dominated  by  the 
number  of  sample  points  for  each  shape,  with  most 
components  of  the  algorithm  exhibiting  between  quadratic 
and  cubic  scaling  behavior.  Using  a  sparse  representation 
throughout,  once  the  shapes  are  roughly  aligned,  the 
complexity  could  be  made  close  to  linear. 

5  Object  Recognition  and  Prototype 
Selection 

Given  a  measure  of  dissimilarity  between  shapes,  which  we 
will  make  precise  shortly,  we  can  proceed  to  apply  it  to  the 
task  of  object  recognition.  Our  approach  falls  into  the  category 
of  prototype-based  recognition.  In  this  framework,  pioneered 
by  Rosch  et  al.  [48],  categories  are  represented  by  ideal 
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Fig.  4.  Illustration  of  the  matching  process  applied  to  the  example  of  Fig.  1 .  Top  row:  1st  iteration.  Bottom  row:  5th  iteration.  Left  column:  estimated 
correspondences  shown  relative  to  the  transformed  model,  with  tangent  vectors  shown.  Middle  column:  estimated  correspondences  shown  relative  to 
the  untransformed  model.  Right  column:  result  of  transforming  the  model  based  on  the  current  correspondences;  this  is  the  input  to  the  next  iteration. 
The  grid  points  illustrate  the  interpolated  transformation  over  R2.  Flere,  we  have  used  a  regularized  TPS  model  with  A0  =  1. 


examples  rather  than  a  set  of  formal  logical  rules.  As  an 
example,  a  sparrow  is  a  likely  prototype  for  the  category  of 
birds;  a  less  likely  choice  might  be  an  penguin.  The  idea  of 
prototypes  allows  for  soft  category  membership,  meaning 
that  as  one  moves  farther  away  from  the  ideal  example  in 
some  suitably  defined  similarity  space,  one's  association  with 
that  prototype  falls  off.  When  one  is  sufficiently  far  away  from 
that  prototype,  the  distance  becomes  meaningless,  but  by 
then  one  is  most  likely  near  a  different  prototype.  As  an 
example,  one  can  talk  about  good  or  so-so  examples  of  the 
color  red,  but  when  the  color  becomes  sufficiently  different, 
the  level  of  dissimilarity  saturates  at  some  maximum  level 
rather  than  continuing  on  indefinitely. 

Prototype-based  recognition  translates  readily  into  the 
computational  framework  of  nearest-neighbor  methods 
using  multiple  stored  views.  Nearest-neighbor  classifiers 
have  the  property  [46]  that  as  the  number  of  examples  n  in 
the  training  set  goes  to  infinity,  the  1-NN  error  converges  to 
a  value  <  2 E* ,  where  E*  is  the  Bayes  Risk  (for  /ANN,  K  — > 


oo  and  K/n^>0,  the  error  — >  E*).  This  is  interesting 
because  it  shows  that  the  humble  nearest-neighbor  classifier 
is  asymptotically  optimal,  a  property  not  possessed  by 
several  considerably  more  complicated  techniques.  Of 
course,  what  matters  in  practice  is  the  performance  for 
small  n,  and  this  gives  us  a  way  to  compare  different 
similarity /distance  measures. 

5.1  Shape  Distance 

In  this  section,  we  make  precise  our  definition  of  shape 
distance  and  apply  it  to  several  practical  problems.  We  used 
a  regularized  TPS  transformation  model  and  three  iterations 
of  shape  context  matching  and  TPS  reestimation.  After 
matching,  we  estimated  shape  distances  as  the  weighted 
sum  of  three  terms:  shape  context  distance,  image  appear¬ 
ance  distance,  and  bending  energy. 

We  measure  shape  context  distance  between  shapes  V 
and  Q  as  the  symmetric  sum  of  shape  context  matching 
costs  over  best  matching  points,  i.e.. 


Fig.  5.  Testing  data  for  empirical  robustness  evaluation,  following  Chui  and  Rangarajan  [9],  The  model  pointsets  are  shown  in  the  first  column. 
Columns  2-4  show  examples  of  target  point  sets  for  the  deformation,  noise,  and  outlier  tests,  respectively. 


516 


IEEE  TRANSACTIONS  ON  PATTERN  ANALYSIS  AND  MACHINE  INTELLIGENCE,  VOL.  24,  NO.  24,  APRIL  2002 


Fig.  6.  Comparison  of  our  results  (□)  to  Chui  and  Rangarajan  (*)  and  iterated  closest  point  (o)  for  the  fish  and  Chinese  character,  respectively.  The  error 
bars  indicate  the  standard  deviation  of  the  error  over  100  random  trials.  Here,  we  have  used  5  iterations  with  A„  —  1.0.  In  the  deformation  and  noise 
tests  no  dummy  nodes  were  added.  In  the  outlier  test,  dummy  nodes  were  added  to  the  model  point  set  such  that  the  total  number  of  nodes  was  equal 
to  that  of  the  target.  In  this  case,  the  value  of  ed  does  not  affect  the  solution. 


where  T(-)  denotes  the  estimated  TPS  shape  transformation. 

In  many  applications  there  is  additional  appearance 
information  available  that  is  not  captured  by  our  notion  of 
shape,  e.g.,  the  texture  and  color  information  in  the 
grayscale  image  patches  surrounding  corresponding  points. 
The  reliability  of  appearance  information  often  suffers 
substantially  from  geometric  image  distortions.  However, 
after  establishing  image  correspondences  and  recovery  of 
underlying  2D  image  transformation  the  distorted  image 
can  be  warped  back  into  a  normal  form,  thus  correcting  for 
distortions  of  the  image  appearance. 

We  used  a  term  T>ac(V,  Q)  for  appearance  cost,  defined  as 
the  sum  of  squared  brightness  differences  in  Gaussian 
windows  around  corresponding  image  points, 

VUV,  Q)  =  -E  E  G(A)  [Mp*  +  A )-IQ{T(qn{l])  +  A)]2, 

n  i= 1  A 

(13) 

where  Ip  and  Iq  are  the  gray-level  images  corresponding  to 
V  and  Q,  respectively.  A  denotes  some  differential  vector 
offset  and  G  is  a  windowing  function  typically  chosen  to  be 
a  Gaussian,  thus  putting  emphasis  to  pixels  nearby.  We 
thus  sum  over  squared  differences  in  windows  around 
corresponding  points,  scoring  the  weighted  gray-level 
similarity. 

This  score  is  computed  after  the  thin  plate  spline 
transformation  T  has  been  applied  to  best  warp  the  images 
into  alignment. 

The  third  term  Q)  corresponds  to  the  "amount"  of 

transformation  necessary  to  align  the  shapes.  In  the  TPS  case 
the  bending  energy  (9)  is  a  natural  measure  (see  [5]). 


5.2  Choosing  Prototypes 

In  a  prototype-based  approach,  the  key  question  is:  what 
examples  shall  we  store?  Different  categories  need  different 
numbers  of  views.  For  example,  certain  handwritten  digits 
have  more  variability  than  others,  e.g.,  one  typically  sees 
more  variations  in  fours  than  in  zeros.  In  the  category  of 
3D  objects,  a  sphere  needs  only  one  view,  for  example, 
while  a  telephone  needs  several  views  to  capture  the  variety 
of  visual  appearance.  This  idea  is  related  to  the  "aspect" 
concept  as  discussed  in  [30].  We  will  now  discuss  how  we 
approach  the  problem  of  prototype  selection. 

In  the  nearest-neighbor  classifier  literature,  the  problem 
of  selecting  exemplars  is  called  editing.  Extensive  reviews  of 
nearest-neighbor  editing  methods  can  be  found  in  Ripley 
[46]  and  Dasarathy  [12].  We  have  developed  a  novel  editing 
algorithm  based  on  shape  distance  and  A'-medoids  cluster¬ 
ing.  A'-medoids  can  be  seen  as  a  variant  of  A'-means  that 
restricts  prototype  positions  to  data  points.  First  a  matrix  of 
pairwise  similarities  between  all  possible  prototypes  is 
computed.  For  a  given  number  of  K  prototypes  the 
A'-medoids  algorithm  then  iterates  two  steps:  1)  For  a  given 
assignment  of  points  to  (abstract)  clusters  a  new  prototype 
is  selected  for  that  cluster  by  minimizing  the  average 
distance  of  the  prototype  to  all  elements  in  the  cluster,  and 
2)  given  the  set  of  prototypes,  points  are  then  reassigned  to 
clusters  according  to  the  nearest  prototype.  More  formally, 
denote  by  c(V)  the  (abstract)  cluster  of  shape  V,  e.g., 
represented  by  some  number  {1, . . . ,  k}  and  denote  by  p(c) 
the  associated  prototype.  Thus,  we  have  a  class  map 


c  :  5i  C  5  — 

and  a  prototype  map 

{l,...,fc} 

(14) 

p  k} 

— >  S-2  C  S. 

(15) 

Here,  6)  and  S2  are  some  subsets  of  the  set  of  all  potential 
shapes  S.  Often,  S  =  S 1  =  1S2.  A'-medoids  proceeds  by 
iterating  two  steps: 
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Fig.  7.  Handwritten  digit  recognition  on  the  MNIST  data  set.  Left:  Test  set  errors  of  a  1-NN  classifier  using  SSD  and  Shape  Distance  (SD)  measures. 
Right:  Detail  of  performance  curve  for  Shape  Distance,  including  results  with  training  set  sizes  of  15,000  and  20,000.  Results  are  shown  on  a 
semilog-x  scale  for  K  =  1,3,5  nearest-neighbors. 


1.  group  <Si  into  classes  given  the  class  prototypes  p(c), 
and 

2.  identify  a  representative  prototype  for  each  class 
given  the  elements  in  the  cluster. 

Basically,  item  1  is  solved  by  assigning  each  shape  V  £  <Si  to 
the  nearest  prototype,  thus 

c(V)  =  argmin  V(V,p(k)).  (16) 

k 

For  given  classes,  in  item  2  new  prototypes  are  selected 
based  on  minimal  mean  dissimilarity,  i.e., 

p(k)  =  argmin  ^  V(V,p).  (17) 

'P:c(shape)=k 

Since  both  steps  minimize  the  same  cost  function 

H(c,p)  =  V(P,P(c(V))),  (18) 

PeSj 

the  algorithm  necessarily  converges  to  a  (local)  minimum. 

As  with  most  clustering  methods,  with  fc-medoids  one 
must  have  a  strategy  for  choosing  k.  We  select  the  number 
of  prototypes  using  a  greedy  splitting  strategy  starting  with 
one  prototype  per  category.  We  choose  the  cluster  to  split 
based  on  the  associated  overall  misclassification  error.  This 
continues  until  the  overall  misclassification  error  has 
dropped  below  a  criterion  level.  Thus,  the  prototypes  are 
automatically  allocated  to  the  different  object  classes,  thus 
optimally  using  available  resources.  The  application  of  this 
procedure  to  a  set  of  views  of  3D  objects  is  explored  in 
Section  6.2  and  illustrated  in  Fig.  10. 

6  Case  Studies 
6.1  Digit  Recognition 

Here,  we  present  results  on  the  MNIST  data  set  of  hand¬ 
written  digits,  which  consists  of  60,000  training  and  10,000  test 
digits  [34].  In  the  experiments,  we  used  100  points  sampled 
from  the  Canny  edges  to  represent  each  digit.  When 
computing  the  C*/ s  for  the  bipartite  matching,  we  included 
a  term  representing  the  dissimilarity  of  local  tangent 
angles.  Specifically,  we  defined  the  matching  cost  as 
Cij  =  (1  —  (3)C-j  +  l3Cj‘-n,  where  is  the  shape  context  cost, 


Cl'"1  =  0.5(1  —  cos(d;  —  9j))  measures  tangent  angle  dissim¬ 
ilarity,  and  (3=  0.1.  For  recognition,  we  used  a  A'-NN 
classifier  with  a  distance  function 

X>  =  1.6T>ac  +  Vsc  +  0.32?be.  (19) 

The  weights  in  (19)  have  been  optimized  by  a  leave-one-out 
procedure  on  a  3, 000  x  3, 000  subset  of  the  training  data. 

On  the  MNIST  data  set  nearly  30  algorithms  have  been 
compared  (http: / / www.research.att.com / ~yann/ exdb / 
mnist/index.html).  The  lowest  test  set  error  rate  published 
at  this  time  is  0.7  percent  for  a  boosted  LeNet-4  with  a 
training  set  of  size  60, 000  x  10  synthetic  distortions  per 
training  digit.  Our  error  rate  using  20,000  training  examples 
and  3-NN  is  0.63  percent.  The  63  errors  are  shown  in  Fig.  8.4 

As  mentioned  earlier,  what  matters  in  practical  applica¬ 
tions  of  nearest-neighbor  methods  is  the  performance  for 
small  n,  and  this  gives  us  a  way  to  compare  different 
similarity /distance  measures.  In  Fig.  7  (left),  our  shape 
distance  is  compared  to  SSD  (sum  of  squared  differences 
between  pixel  brightness  values) .  In  Fig.  7  (right),  we  compare 
the  classification  rates  for  different  K. 

6.2  3D  Object  Recognition 

Our  next  experiment  involves  the  20  common  household 
objects  from  the  COIL-20  database  [40].  Each  object  was 
placed  on  a  turntable  and  photographed  every  5°  for  a  total 
of  72  views  per  object.  We  prepared  our  training  sets  by 
selecting  a  number  of  equally  spaced  views  for  each  object 
and  using  the  remaining  views  for  testing.  The  matching 
algorithm  is  exactly  the  same  as  for  digits.  Recall,  that  the 
Canny  edge  detector  responds  both  to  external  and  internal 
contours,  so  the  100  sample  points  are  not  restricted  to  the 
external  boundary  of  the  silhouette. 

Fig.  9  shows  the  performance  using  1-NN  with  the 
distance  function  D  as  given  in  (19)  compared  to  a 

4.  DeCoste  and  Scholkopf  [13]  report  an  error  rate  of  0.56  percent  on  the 
same  database  using  Virtual  Support  Vectors  (VSV)  with  the  full  training 
set  of  60,000.  VSVs  are  found  as  follows:  1)  obtain  SVs  from  the  original 
training  set  using  a  standard  SVM,  2)  subject  the  SVs  to  a  set  of  desired 
transformations  (e.g.,  translation),  3)  train  another  SVM  on  the  generated 
examples. 
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Fig.  8.  All  of  the  misclassified  MNIST  test  digits  using  our  method  (63  out  of  10,000).  The  text  above  each  digit  indicates  the  example  number 
followed  by  the  true  label  and  the  assigned  label. 


straightforward  sum  of  squared  differences  (SSD).  SSD 
performs  very  well  on  this  easy  database  due  to  the  lack  of 
variation  in  lighting  [24]  (PC A  just  makes  it  faster). 

The  prototype  selection  algorithm  is  illustrated  in  Fig.  10. 
As  seen,  views  are  allocated  mainly  for  more  complex 
categories  with  high  within  class  variability.  The  curve 
marked  SC-proto  in  Fig.  9  shows  the  improved  classification 
performance  using  this  prototype  selection  strategy  instead 
of  equally-spaced  views.  Note  that  we  obtain  a  2.4  percent 


Fig.  9.  3D  object  recognition  using  the  COIL-20  data  set.  Comparison  of 
test  set  error  for  SSD,  Shape  Distance  (SD),  and  Shape  Distance  with 
fc-medoids  prototypes  (SD-proto)  versus  number  of  prototype  views.  For 
SSD  and  SD,  we  varied  the  number  of  prototypes  uniformly  for  all 
objects.  For  SD-proto,  the  number  of  prototypes  per  object  depended  on 
the  within-object  variation  as  well  as  the  between-object  similarity. 


error  rate  with  an  average  of  only  four  two-dimensional 
views  for  each  three-dimensional  object,  thanks  to  the 
flexibility  provided  by  the  matching  algorithm. 

6.3  MPEG-7  Shape  Silhouette  Database 

Our  next  experiment  involves  the  MPEG-7  shape  silhouette 
database,  specifically  Core  Experiment  CE-Shape-1  part  B, 
which  measures  performance  of  similarity-based  retrieval 
[25].  The  database  consists  of  1,400  images:  70  shape 
categories,  20  images  per  category.  The  performance  is 
measured  using  the  so-called  "bullseye  test,"  in  which  each 


mum 


Fig.  10.  Prototype  views  selected  for  two  different  3D  objects  from  the 
COIL  data  set  using  the  algorithm  described  in  Section  5.2.  With  this 
approach,  views  are  allocated  adaptively  depending  on  the  visual 
complexity  of  an  object  with  respect  to  viewing  angle. 
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image  is  used  as  a  query  and  one  counts  the  number  of 
correct  images  in  the  top  40  matches. 

As  this  experiment  involves  intricate  shapes  we  in¬ 
creased  the  number  of  samples  from  100  to  300.  In  some 
categories,  the  shapes  appear  rotated  and  flipped,  which  we 
address  using  a  modified  distance  function.  The  distance 
distf/?.  Q)  between  a  reference  shape  R  and  a  query  shape 
Q  is  defined  as 

dist(Q,  R)  =  min{dist(Q,  Ra),  dist(Q,  Rb ),  dist(Q,  Rc)}, 

where  Ra,Rb,  and  Rc  denote  three  versions  of  R:  un¬ 
changed,  vertically  flipped,  and  horizontally  flipped. 

With  these  changes  in  place  but  otherwise  using  the 
same  approach  as  in  the  MNIST  digit  experiments,  we 
obtain  a  retrieval  rate  of  76.51  percent.  Currently  the  best 
published  performance  is  achieved  by  Latecki  et  al.  [33], 
with  a  retrieval  rate  of  76.45  percent,  followed  by  Mokhtar- 
ian  et  al.  at  75.44  percent. 

6.4  Trademark  Retrieval 

Trademarks  are  visually  often  best  described  by  their  shape 
information,  and,  in  many  cases,  shape  provides  the  only 
source  of  information.  The  automatic  identification  of 
trademark  infringement  has  interesting  industrial  applica¬ 
tions,  since  with  the  current  state  of  the  art  trademarks  are 
broadly  categorized  according  to  the  Vienna  code,  and  then 
manually  classified  according  to  their  perceptual  similarity. 
Even  though  shape  context  matching  does  not  provide  a  full 
solution  to  the  trademark  similarity  problem  (other  poten¬ 
tial  cues  are  text  and  texture),  it  still  serves  well  to  illustrate 
the  capability  of  our  approach  to  capture  the  essence  of 
shape  similarity.  In  Fig.  12,  we  depict  retrieval  results  for  a 
database  of  300  trademarks.  In  this  experiment,  we  relied  on 
an  affine  transformation  model  as  given  by  (3),  and  as  in  the 
previous  case,  we  used  300  sample  points. 

We  experimented  with  eight  different  query  trademarks 
for  each  of  which  the  database  contained  at  least  one  potential 
infringement.  We  depict  the  top  four  hits  as  well  as  their 
similarity  score.  It  is  clearly  seen  that  the  potential  infringe¬ 
ments  are  easily  detected  and  appear  as  most  similar  in  the 
top  ranks  despite  substantial  variation  of  the  actual  shapes.  It 
has  been  manually  verified  that  no  visually  similar  trademark 
has  been  missed  by  the  algorithm. 


query 

1:  0.046 

2:  0.107 

3:  0.114 

A 

A 

0, 

A 

query 

1:  0.016 

2:  0.107 

3:  0.114 

7« 

query 

1:  0.117 

2:  0.121 

3:  0.129 

query  1:  0.096  2:  0.147  3:  0.153 

R 

query  1:0.078  2:0.116  3:0.122 

query  1:0.092  2:0.10  3:0.102 

Fig.  12.  Trademark  retrieval  results  based  on  a  database  of  300  different 
real-world  trademarks.  We  used  an  affine  transformation  model  and  a 
weighted  combination  of  shape  context  similarity  Dsc  and  the  sum  over 
local  tangent  orientation  differences. 

7  Conclusion 

We  have  presented  a  new  approach  to  shape  matching.  A  key 
characteristic  of  our  approach  is  the  estimation  of  shape 
similarity  and  correspondences  based  on  a  novel  descriptor, 
the  shape  context.  Our  approach  is  simple  and  easy  to  apply, 
yet  provides  a  rich  descriptor  for  point  sets  that  greatly 
improves  point  set  registration,  shape  matching  and  shape 
recognition.  In  our  experiments,  we  have  demonstrated 
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Fig.  13.  Kimia  data  set:  each  row  shows  instances  of  a  different  object 
category.  Performance  is  measured  by  the  number  of  closest  matches 
with  the  correct  category  label.  Note  that  several  of  the  categories  require 
rotation  invariant  matching  for  effective  recognition.  All  of  the  1st  ranked 
closest  matches  were  correct  using  our  method.  Of  the  2nd  ranked 
matches,  one  error  occurred  in  1  versus  8.  In  the  3rd  ranked  matches, 
confusions  arose  from  2  versus  8,  8  versus  1,  and  15  versus  17. 


invariance  to  several  common  image  transformations,  in¬ 
cluding  significant  3D  rotations  of  real-world  objects. 


Appendix  A 

Complete  Rotation  Invariant  Recognition 

In  this  appendix,  we  demonstrate  the  use  of  the  relative 
frame  in  our  approach  as  a  means  of  obtaining  complete 
rotation  invariance.  To  demonstrate  this  idea,  we  have  used 
the  database  provided  by  Sharvit  et  al.  [53]  shown  in  Fig.  13. 
In  this  experiment,  we  used  n  =  100  sample  points  and,  as 
mentioned  above,  we  used  the  relative  frame  (Section  3.3) 
when  computing  the  shape  contexts.  We  used  five  bins  for 
log(r)  over  the  range  0.125a  to  2a  and  12  equally  spaced 
radial  bins  in  these  and  all  other  experiments  in  this  paper. 
No  transformation  model  at  all  was  used.  As  a  similarity 
score,  we  used  the  matching  cost  function  JA  CiMi)  after 
one  iteration  with  no  transformation  step.  Thus,  this 
experiment  is  specifically  designed  solely  to  evaluate  the 
power  of  the  shape  descriptor  in  the  face  of  rotation. 

In  [53]  and  [17],  the  authors  summarize  their  results  on 
this  data  set  by  stating  the  number  of  1st,  2nd,  and  3rd 
nearest-neighbors  that  fall  into  the  correct  category.  Our 
results  are  25/25,  24/25,  22/25.  In  [53]  and  [17],  the  results 
quoted  are  23/25,  21/25,  20/25  and  25/25,  21/25,  19/25, 
respectively. 


Appendix  B 

Sampling  Considerations 

In  our  approach,  a  shape  is  represented  by  a  set  of  sample 
points  drawn  from  the  internal  and  external  contours  of  an 
object.  Operationally,  one  runs  an  edge  detector  on  the 


gray-scale  image  and  selects  a  subset  of  the  edge  pixels 
found.  The  selection  could  be  uniformly  at  random,  but  we 
have  found  it  to  be  advantageous  to  ensure  that  the  sample 
points  have  a  certain  minimum  distance  between  them  as 
this  makes  sure  that  the  sampling  along  the  contours  is 
somewhat  uniform.  (This  corresponds  to  sampling  from  a 
point  process  which  is  a  hard-core  model  [45].) 

Since  the  sample  points  are  drawn  randomly  and 
independently  from  the  two  shapes,  there  is  inevitably 
jitter  noise  in  the  output  of  the  matching  algorithm  which 
finds  correspondences  between  these  two  sets  of  sample 
points.  However,  when  the  transformation  between  the 
shapes  is  estimated  as  a  regularized  thin  plate  spline,  the 
effect  of  this  jitter  is  smoothed  away. 
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